TEXT CATEGORIZATION Building a kNN classifier for the Reuters-21578 collection

نویسنده

  • Anita Krishnakumar
چکیده

Categorization of texts into topical categories has gained booming interest over the past few years. There is a growing need for tools that help in finding, filtering and managing the highdimensional data due to the rapid growth of online information. Building a text classifier by hand is time consuming and costly and hence automated text categorization has gained a lot of importance. A general inductive process automatically builds a classifier by learning, from a set of previously classified documents, the characteristics of one or more categories. In this project we look at the main approaches that have been taken towards text categorization. Also, the K-nearest neighbour algorithm is used for building a classifier for the Reuters collection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An kNN Model-Based Approach and Its Application in Text Categorization

An investigation has been conducted on two well known similarity-based learning approaches to text categorization. This includes the k-nearest neighbor (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, we propose a new classifier called the kNN model-based classifier by unifying the strengths of k-NN and Rocchio classifier and adapting t...

متن کامل

Using kNN Model-based Approach for Automatic Text Categorization

An investigation has been conducted on two well known similarity-based learning approaches to text categorization: the k-nearest neighbor (k-NN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new classifier called the kNN model-based classifier (kNNModel) has been proposed. It combines the strength of both k-NN and Rocchio. A text categor...

متن کامل

Experiments with multi-label text classifier on the Reuters collection

Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow...

متن کامل

Text categorization on Reuters corpus

1 Task The task of text categorization can be described as follows: given a set of documents, we want to assign to each document one or more text categories or no category. In this term project, we want categorize documents from the well-known Reuters-21578 corpus which is a collection of 21578 articles published on Reuters in 1987. We have chosen only three most frequent text categories as the...

متن کامل

Rough set based hybrid algorithm for text classification

Automatic classification of text documents, one of essential techniques for Web mining, has always been a hot topic due to the explosive growth of digital documents available on-line. In text classification community, k-nearest neighbor (kNN) is a simple and yet effective classifier. However, as being a lazy learning method without premodelling, kNN has a high cost to classify new documents whe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006